NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Enabling Software Resilience in GPGPU Applications via Partial Thread Protection

https://doi.org/10.1109/ICSE43902.2021.00114

Yang, Lishan; Nie, Bin; Jog, Adwait; Smirni, Evgenia (May 2021, 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE))
null (Ed.)
Full Text Available
SUGAR: Speeding Up GPGPU Application Resilience Estimation with Input Sizing

https://doi.org/10.1145/3447375

Yang, Lishan; Nie, Bin; Jog, Adwait; Smirni, Evgenia (February 2021, Proceedings of the ACM on Measurement and Analysis of Computing Systems)
null (Ed.)
As Graphics Processing Units (GPUs) are becoming a de facto solution for accelerating a wide range of applications, their reliable operation is becoming increasingly important. One of the major challenges in the domain of GPU reliability is to accurately measure GPGPU application error resilience. This challenge stems from the fact that a typical GPGPU application spawns a huge number of threads and then utilizes a large amount of potentially unreliable compute and memory resources available on the GPUs. As the number of possible fault locations can be in the billions, evaluating every fault and examining its effect on theapplication error resilience is impractical. Application resilience is evaluated via extensive fault injection campaigns based on sampling of an extensive fault site space. Typically, the larger the input of the GPGPU application, the longer the experimental campaign. In this work, we devise a methodology, SUGAR (Speeding Up GPGPU Application Resilience Estimation with input sizing), that dramatically speeds up the evaluation of GPGPU application error resilience by judicious input sizing. We show how analyzing a small fraction of the input is sufficient to estimate the application resilience with high accuracy and dramatically reduce the duration of experimentation. Key of our estimation methodology is the discovery of repeating patterns as a function of the input size. Using the well-established fact that error resilience in GPGPU applications is mostly determined by the dynamic instruction count at the thread level, we identify the patterns that allow us to accurately predict application error resilience for arbitrarily large inputs. For the cases that we examine in this paper, this new resilience estimation mechanism provides significant speedups (up to 1336 times) and 97.0 on the average, while keeping estimation errors to less than 1%.
more » « less
Full Text Available
Practical Resilience Analysis of GPGPU Applications in the Presence of Single- and Multi-Bit Faults

https://doi.org/10.1109/TC.2020.2980541

Yang, Lishan; Nie, Bin; Jog, Adwait; Smirni, Evgenia (January 2021, IEEE Transactions on Computers)
null (Ed.)
Full Text Available
Characterizing Accuracy-Aware Resilience of GPGPU Applications

https://doi.org/10.1109/CCGrid49817.2020.00-82

Nie, Bin; Jog, Adwait; Smirni, Evgenia (May 2020, 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID))
null (Ed.)
Full Text Available
Mining Multivariate Discrete Event Sequences for Knowledge Discovery and Anomaly Detection

https://doi.org/10.1109/DSN48063.2020.00067

Nie, Bin; Xu, Jianwu; Alter, Jacob; Chen, Haifeng; Smirni, Evgenia (June 2020, Proceedings of the 50th IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2020)
null (Ed.)
Modern physical systems deploy large numbers of sensors to record at different time-stamps the status of different systems components via measurements such as temperature, pressure, speed, but also the component's categorical state. Depending on the measurement values, there are two kinds of sequences: continuous and discrete. For continuous sequences, there is a host of state-of-the-art algorithms for anomaly detection based on time-series analysis, but there is a lack of effective methodologies that are tailored specifically to discrete event sequences. This paper proposes an analytics framework for discrete event sequences for knowledge discovery and anomaly detection. During the training phase, the framework extracts pairwise relationships among discrete event sequences using a neural machine translation model by viewing each discrete event sequence as a "natural language". The relationship between sequences is quantified by how well one discrete event sequence is "translated" into another sequence. These pairwise relationships among sequences are aggregated into a multivariate relationship graph that clusters the structural knowledge of the underlying system and essentially discovers the hidden relationships among discrete sequences. This graph quantifies system behavior during normal operation. During testing, if one or more pairwise relationships are violated, an anomaly is detected. The proposed framework is evaluated on two real-world datasets: a proprietary dataset collected from a physical plant where it is shown to be effective in extracting sensor pairwise relationships for knowledge discovery and anomaly detection, and a public hard disk drive dataset where its ability to effectively predict upcoming disk failures is illustrated.
more » « less
Full Text Available
Fault Site Pruning for Practical Reliability Analysis of GPGPU Applications

https://doi.org/10.1109/MICRO.2018.00066

Nie, Bin; Yang, Lishan; Jog, Adwait; Smirni, Evgenia (October 2018, 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO))

Graphics Processing Units (GPUs) have rapidly evolved to enable energy-efficient data-parallel computing for a broad range of scientific areas. While GPUs achieve exascale performance at a stringent power budget, they are also susceptible to soft errors, often caused by high-energy particle strikes, that can significantly affect the application output quality. Understanding the resilience of general purpose GPU applications is the purpose of this study. To this end, it is imperative to explore the range of application output by injecting faults at all the potential fault sites. This problem is especially challenging because unlike CPU applications, which are mostly single-threaded, GPGPU applications can contain hundreds to thousands of threads, resulting in a tremendously large fault site space - in the order of billions even for some simple applications. In this paper, we present a systematic way to progressively prune the fault site space aiming to dramatically reduce the number of fault injections such that assessment for GPGPU application error resilience can be practical. The key insight behind our proposed methodology stems from the fact that GPGPU applications spawn a lot of threads, however, many of them execute the same set of instructions. Therefore, several fault sites are redundant and can be pruned by a careful analysis of faults across threads and instructions. We identify important features across a set of 10 applications (16 kernels) from Rodinia and Polybench suites and conclude that threads can be first classified based on the number of the dynamic instructions they execute. We achieve significant fault site reduction by analyzing only a small subset of threads that are representative of the dynamic instruction behavior (and therefore error resilience behavior) of the GPGPU applications. Further pruning is achieved by identifying and analyzing: a) the dynamic instruction commonalities (and differences) across code blocks within this representative set of threads, b) a subset of loop iterations within the representative threads, and c) a subset of destination register bit positions. The above steps result in a tremendous reduction of fault sites by up to seven orders of magnitude. Yet, this reduced fault site space accurately captures the error resilience profile of GPGPU applications.
more » « less
Full Text Available
Fill-in the gaps: Spatial-temporal models for missing data

https://doi.org/10.23919/CNSM.2017.8255983

Xue, Ji; Nie, Bin; Smirni, Evgenia (November 2017, 13th International Conference on Network and Service Management, CNSM 2017)

Effective workload characterization and prediction are instrumental for efficiently and proactively managing large systems. System management primarily relies on the workload information provided by underlying system tracing mechanisms that record system-related events in log files. However, such tracing mechanisms may temporarily fail due to various reasons, yielding “holes” in data traces. This missing data phenomenon significantly impedes the effectiveness of data analysis. In this paper, we study real-world data traces collected from over 80K virtual machines (VMs) hosted on 6K physical boxes in the data centers of a service provider. We discover that the usage series of VMs co-located on the same physical box exhibit strong correlation with one another, and that most VM usage series show temporal patterns. By taking advantage of the observed spatial and temporal dependencies, we propose a data-filling method to predict the missing data in the VM usage series. Detailed evaluation using trace data in the wild shows that the proposed method is sufficiently accurate as it achieves an average of 20% absolute percentage errors. We also illustrate its usefulness via a use case.
more » « less
Full Text Available
Machine Learning Models for GPU Error Prediction in a Large Scale HPC System

https://doi.org/10.1109/DSN.2018.00022

Nie, Bin; Xue, Ji; Gupta, Saurabh; Patel, Tirthak; Engelmann, Christian; Smirni, Evgenia; Tiwari, Devesh (June 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN))

GPUs are widely deployed on large-scale HPC systems to provide powerful computational capability for scientific applications from various domains. As those applications are normally long-running, investigating the characteristics of GPU errors becomes imperative for reliability. In this paper, we first study the system conditions that trigger GPU errors using six-month trace data collected from a large-scale, operational HPC system. Then, we use machine learning to predict the occurrence of GPU errors, by taking advantage of temporal and spatial dependencies of the trace data. The resulting machine learning prediction framework is robust and accurate under different workloads.
more » « less
Full Text Available
Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities

https://doi.org/10.1109/MASCOTS.2017.12

Nie, Bin; Xue, Ji; Gupta, Saurabh; Engelmann, Christian; Smirni, Evgenia; Tiwari, Devesh (September 2017, 2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS))

GPUs have become part of the mainstream high performance computing facilities that increasingly require more computational power to simulate physical phenomena quickly and accurately. However, GPU nodes also consume significantly more power than traditional CPU nodes, and high power consumption introduces new system operation challenges, including increased temperature, power/cooling cost, and lower system reliability. This paper explores how power consumption and temperature characteristics affect reliability, provides insights into what are the implications of such understanding, and how to exploit these insights toward predicting GPU errors using neural networks.
more » « less
Full Text Available

Search for: All records